Skip to content

Conversation

@Dan-Flores
Copy link
Contributor

This PR creates a simple VideoEncoder class, and updates several tests to utilize the VideoEncoder.to_file pattern.

  • test_bad_input_parameterized is the VideoEncoder equivalent to the AudioEncoder test that ensures general error checking occurs, while test_bad_input tests method specific errors.

  • test_contiguity: is the VideoEncoder equivalent to the AudioEncoder test to ensure contiguous and non-contiguous tensors can be encoded, and are encoded equivalently.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 21, 2025
@Dan-Flores Dan-Flores marked this pull request as ready for review October 28, 2025 14:35
Copy link
Contributor

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Dan-Flores , looks great, made a quick first pass

frames (``torch.Tensor``): The frames to encode. This must be a 4D
tensor of shape ``(N, C, H, W)`` where N is the number of frames,
C is 3 channels (RGB), H is height, and W is width.
A 3D tensor of shape ``(C, H, W)`` is also accepted as a single RGB frame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        A 3D tensor of shape ``(C, H, W)`` is also accepted as a single RGB frame.

Q - Do we need to support that? I'm wondering if it makes a lot of sense to just encode a single image as a video. I suspect this was made to mimic the AudioEncoder behavior but that was a different use-case. In the AudioEncoder we want to allow for 1D audio to be supported as it's still a valid waveform. But I don't think we need to treat a single frame as a valid video.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the use case is for encoding an image as a video, but since FFmpeg allows encoding an image to video, I believe we can retain this functionality for a relatively low cost.

Copy link
Contributor

@NicolasHug NicolasHug Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the use case is for encoding an image as a video

I'm not sure either - let's not do it then :). One principle we have is that we should only implement things that we know are useful. In doubt, it's best not to expose it, it makes our life easier as maintainers. Of course if users start to explicitly request this feature, we may reconsider.

In the near future we will start implementing proper image encoders in torchcodec, and another guiding principle is that there should be one-- and preferably only one --obvious way to do it. (ref). If we were to support 3D tensors (images) into the VideoEncoder then we'd potentially expose two ways to do the same thing.

It doesn't mean we'll prevent users from doing it if they really want to: a user can still just call unsqueeze(0) on their 3D tensor and pass it. It will work. But we don't need to explicitly support 3D tensors on our side.

Comment on lines 19 to 20
frame_rate (int): The frame rate to use when encoding the
**input** ``frames``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation of this description is that this parameter actually defines the the frame rate of the encoded output, because of the "frame rate to use when encoding" part.

I think it might be less ambiguous as

The frame rate of the input frames. This is not the frame rate of the encoded output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use this variable in the Encoder.cpp: avCodecContext_->framerate = {inFrameRate_, 1};.
Perhaps the naming is confusing (should be inputFrameRate rather than inFrameRate), but is it not accurate to say this defines the frame rate of the encoded output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, you're right, frame_rate and inFrameRate currently define both the frame rate of the input and the frame rate of the encoded output.

But in a future implementation, we might let the user choose to encode the output with a different frame rate. If/when that happens, then this frame_rate parameter will only be about the input frame rate, while the frame_rate parameter that we'll add to the decoding method (to_file(frame_rate=...)) will define the output frame rate. It will be similar to what we do in the AudioEncoder with the sample_rate parameters:

My original docstring suggestion was definitely wrong (thanks for catching!) but I'd still suggest to clarify that this is really the input frame rate. And, for now, it also happens to be what we use for the output frame rate:

The frame rate of the input frames. Also defines the encoded output frame rate.

Copy link
Contributor Author

@Dan-Flores Dan-Flores Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this - I'll make a note to add frame_rate to the encoding methods (to_file, etc) next.

Args:
dest (str or ``pathlib.Path``): The path to the output file, e.g.
``video.mp4``. The extension of the file determines the video
format and container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the distinction between "format" and "container" here? I would just use one or the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terms are used interchangeably, so I included both here to be understandable to users of both terms. Let me know if that is actually more confusing that just naming one term.

I might have added the distinction after discovering the format mkv is defined as a matroska container. I have not encountered other formats where these are different.

Copy link
Contributor

@NicolasHug NicolasHug Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like saying

determines the video format and container.

may suggest to some readers that "format" and "container" refer to different things. Especially because a) "format" is sometimes used to refer to the codec, and b) in the methods we only have one single "format" parameter and we never mention the "container".

How about using the term "container format" everywhere in the docstrings? I think that is the actual correct technical term for that (wikipedia):

Notable examples of container formats include [...] formats used for multimedia playback (such as Matroska, MP4, and AVI).

The parameter can still be format - I think this makes sense for consistency with the AudioEncoder, but in the docstrings of all methods we could clarify that format corresponds to the "container format"?

@Dan-Flores Dan-Flores merged commit d63504c into meta-pytorch:main Oct 29, 2025
59 checks passed
@Dan-Flores Dan-Flores deleted the vid_encoder_python branch October 29, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants